Overview

This document describes the results of the multi-omics non-negative factorization (NMF)-based clustering module. For more information about the module please visit the PANOPLY Wiki page.


Input data matrix

The data matrix subjected to NMF analysis contained 39785 features measured across 76 samples. Table 1 summarizes the number of features used in the clustering and their dataype(s).

Table 1: Number of features used for clustering.
Type Number of features
CNV 14578
prot 7577
pSTY 3781
RNA 13849

Determining the number of clusters

To determine an optimal value k for the number of clusters, a range of k between 2 and 10 was evaluated using several metrics:

  • Cophenetic correlation coefficient (coph) measuring how well the intrinsic structure of the data is recapitulated after clustering.
  • Dispersion coeffiecient (disp) of the consensus matrix as defined in Kim and Park, 2007 measuring the reproducibility of the clustering across 50 random iterations.
  • Silhouette score (sil) measuring how similar a sample is to its own cluster (cohesion) compared to other clusters (separation) and thus is defined for each sample. The average silhoutte score across all samples serves is calcualted for each cluster number k.

The metrics are summarized in Figure 1. The optimal number of clusters is defined as the maximum of the product of coph and disp between k=3 and k=10.

Figure 1: Cluster metrics as a function of cluster numbers.


Clustering results

The 76 samples were separated into 3 clusters. Table 2 summarizes the number of samples in each cluster.

Table 2: Cluster composition. The cluster core is defined as samples with a cluster membership score > 0.5
Cluster # samples # core samples
C1 21 17
C2 26 24
C3 29 23

Sample coefficient matrix

The heatmap shown in Figure 2 is a visualization of the meta-feature matrix derived from decomposing the input matrix, normalized per column by the maximum entry. The matrix presents one of the main results of NMF as it provides the basis of assigning samples to clusters.

**Figure 2**: Heatmap depicting the relative contributions of each sample (x-axis) to each cluster (y-axis). Samples are ordered by cluster and cluster membership score in decreasing order.

Figure 2: Heatmap depicting the relative contributions of each sample (x-axis) to each cluster (y-axis). Samples are ordered by cluster and cluster membership score in decreasing order.


Overrepresentation analysis

Table 3 summarizes the results of an overpresentation analysis of sample metadata terms (e.g. clinial annotation, inferred phenotypes, etc.) in each cluster. Shown are nominal p-values derived from a Fisher’s exact test (p<0.01, 0.01<p<0.02, 0.02<p<0.05). All samples with cluster memebrship score > 0.5 were used to characterize the clusters.

Table 3: Overrepresentation analysis of sample metadata terms in each cluster.
C1 C2 C3
PAM50:Basal 1.0000000 0.0000000 1.0000000
PAM50:LumB 1.0000000 0.9999685 0.0000000
PAM50:LumA 0.0000001 0.9999986 0.9079225
ER.Status:Negative 0.9998154 0.0000000 1.0000000
ER.Status:Positive 0.0028403 1.0000000 0.0000030
PR.Status:Negative 0.9973536 0.0000000 0.9999982
PR.Status:Positive 0.0155967 1.0000000 0.0000326
TP53.mutation:1 0.9997822 0.0000027 0.9819766
TP53.mutation:0 0.0019130 0.9999999 0.0585794
PIK3CA.mutation:0 0.9590769 0.0294839 0.8605083

Cluster-specifc features

Matrix W containing the weights of each feature in a certain cluster was used to derive a list of r representative features separating the clusters using the method proposed in (Kim and Park, 2007). In order to derive a p-value for each cluster-specific feature, a 2-sample moderated t-test (Ritchie et al., 2015) was used to compare the abundance of the features between the respective cluster and all other clusters. Derived p-values were adjusted for multiple hypothesis testing using the methods proposed in (Benjamini and Hochberg, 1995). Features with FDR <are used in subsequent analyses.

**Figure 3**: Heatmap depicting abundances of cluster specific features defined as descibed above. Samples are ordered by cluster and cluster membership score in decreasing order.

Figure 3: Heatmap depicting abundances of cluster specific features defined as descibed above. Samples are ordered by cluster and cluster membership score in decreasing order.



In total 330 features separating the clusters have been detected using the method descibed above. The distribution of features across the different clusters are shown in Figure 4.

**Figure 4**: Barpchart depicting the number of cluster specific features

Figure 4: Barpchart depicting the number of cluster specific features


The data table below depicts all cluster specific features. The table is interactive and can be sorted and filtered. Please note that the table represents a condensed verison of the entire table which can be found the Excel sheet NMF_features_N_330.xlsx


Cluster stability

Consensus matrix

The entries in the sample-by-samle matrix shown in Figure 5 depict the relative frequences with which two samples were assigned to the same cluster across 50 iterations.

**Figure 5**: Consensus matrix derived from 50 randomly initialized iterations.

Figure 5: Consensus matrix derived from 50 randomly initialized iterations.


Silhouette plot

Silhouette scores indicate how similar a sample is to its own cluster compared to other clusters. The silhouette plot shown in Figure 6 depicts the consistency of the derived clusters. Samples with negative silhouette score indicate outliers in the respective cluster.

**Figure 6**: Silhouette plot illustrating the silhouette score (x-axis) for each sample (y-axis) grouped by each cluster (_K_=3). Number of samples and average silhouette scores per cluster are shown on the right side.

Figure 6: Silhouette plot illustrating the silhouette score (x-axis) for each sample (y-axis) grouped by each cluster (K=3). Number of samples and average silhouette scores per cluster are shown on the right side.


Parameters

Details about the parameters listed in Table 4 can be found in the PANOPLY WIKI.

Table 4: List of parameters used in panoply_mo_nmf.
param value
kmin 2
kmax 10
exclude_2 TRUE
core_membership 0.5
nrun 50
seed random
method brunet
bnmf FALSE
feature_fdr 0.01
ora_pval 0.01
ora_max_categories 10
hm_cw 5
hm_ch 8
hm_max_val 10
hm_max_val_z 4
filt_mode global
sd_filt 0.05
z_score TRUE
impute FALSE
impute_k 5
max_na_row 0.3
max_na_col 0.9
gene_col geneSymbol
nmf_only FALSE
organism human
tar_file /cromwell_root/fc-3c36c89e-bca9-4372-b58b-2a820ecb71ef/44eb133e-5a7b-4197-9ac0-f3d01be82d04/panoply_unified_workflow/4f1cfe26-53b7-4522-ad32-8082485723c9/call-nmf/mo_nmf_wdl.panoply_mo_nmf_gct_workflow/da0d8709-69d1-40f2-887f-aa617216d795/call-panoply_mo_nmf_pre/cacheCopy/all.tar
lib_dir /home/pgdac/src/
yaml_file /cromwell_root/fc-3c36c89e-bca9-4372-b58b-2a820ecb71ef/panoply-parameters.yaml
help FALSE
cat_anno PAM50;ER.Status;PR.Status;HER2.Status;TP53.mutation;PIK3CA.mutation;GATA3.mutation
cont_anno NA
cat_colors PAM50=Her2:#F9BFCB;Basal:#EE2025;LumB:#ADDAE8;LumA:#3953A5|ER.Status=Negative:#FFFFFF;Positive:#000000|PR.Status=Negative:#FFFFFF;Positive:#000000|HER2.Status=Positive:#000000;Negative:#FFFFFF;Equivocal:#808080|TP53.mutation=1:#000000;0:#FFFFFF|PIK3CA.mutation=0:#FFFFFF;1:#000000|GATA3.mutation=0:#FFFFFF;1:#000000
blank_anno N/A
blank_anno_col white



Created on 2020-11-13 05:06:57